林华 | 人工智能数据训练的法律竞争
目次
1 如果数据为王
2 人工智能数据训练的基础版权法规
2.1 著作权法
2.2 《生成式人工智能服务管理暂行办法》
3 数据输入阶段的版权分析
3.1 行为分解的意义
3.1.1 解构是为了更好的整体分析
3.1.2 数据输入和结果输出相互独立
3.1.3 分别适用法律规则
3.2 中国法对数据输入阶段的相关规定
3.2.1 著作权法没有规定“个人”必须是自然人
3.2.2 著作权法使用的“个人”有充分的解释空间
3.2.3 类似情况适用参照规则
3.2.4 实践需要扩大解释或参照适用
3.3 合理使用立法模式对AI数据训练的影响
4 数据训练版权的整体式规定
4.1 美国版权法
4.1.1 版权法107条
4.1.2 关于数据训练与合理使用的专家意见
4.2 欧盟立法
4.2.1 欧盟立法结构
4.2.2 DSM指令与TDM规则
4.3 英国立法
4.4 日本立法
4.5 韩国立法
4.6 以色列立法
5 合理使用或版权保护的例外
5.1 中国态度与国际条约义务
5.2 其他海外立法
5.2.1 排除对非表达部分的保护
5.2.2 间接允许使用
6 案例分析与借鉴
6.1 中国案例分析
6.1.1 使用他人作品元素的商业实践
6.1.2 使用他人作品元素的侵权案例
6.1.3 经典版权合理使用案——听音识剧
6.1.4 观点总结
6.2 美国案例分析1——安德森等诉Stability AI等
6.3 美国案例分析2——美国Getty Images诉Stability AI
6.3.1 商标权的主张和证据
6.3.2 技术贴的论证——Stable Diffusion有没有故意侵权
7 人工智能数据训练的合理使用是一次全球法律的竞争
7.1 大人,时代变了
7.2 围绕人工智能的法律竞争
1 如果数据为王
2 人工智能数据训练的基础版权法规
2.1 《著作权法》
2.2 《生成式人工智能服务管理暂行办法》
3 数据输入阶段的版权分析
3.1 行为分解的意义
3.2 中国法对数据输入阶段的相关规定
3.3 合理使用立法模式对AI数据训练的影响
4 数据训练版权的整体式规定
4.1 美国版权法
4.2 欧盟立法
1. 欧盟立法结构
有必要先简单介绍欧盟关于人工智能数据训练相关立法的结构。形式上欧盟关于AI的立法由进入最后阶段的《人工智能法》(Artificial Intelligence Act,也译AI法案)和2019年发布的《关于数字单一市场版权及相关权的指令》(Directive (EU) 2019/790 on copyright in the Digital Single Market,简称“DSM”)。
2. DSM指令与TDM规则
(18)文本和数据挖掘技术除了在科学研究中的重要性之外,还被私有和公共主体为各种目的和分析不同生活领域而广泛使用,包括政府服务、复杂商业决策以及新应用或技术的开发。……在此类情况下为提供更多法律的确定性并鼓励私有经济体的创新,本指令应在相应情况下设置为文本和数据挖掘目的对作品或其他主题的复制和摘录的例外或限制(注:即合理使用)。
4.3 英国立法
4.4 日本立法
4.5 韩国立法
4.6 以色列立法
5 合理使用或版权保护的例外
5.1 中国态度与国际条约义务
如本文第二部分所述,单独看网信办等新近颁布的《生成式人工智能服务管理暂行办法》第4条和第7条,已经封闭了AI数据训练从输入到结果输出适用合理使用的可能。但是即使暂行办法排除合理使用,依旧存在两种重新适用的可能。
第一种可能是通过立法或者对《著作权法》进行解释,前文且已论证在法律执行中进行解释至少有能力解决利用人工智能数据训练进行科学研究中的合理使用问题;第二种可能是寻找合理使用制度以外支持利用受著作权保护作品进行数据训练的依据,最有可能完成这项任务的是著作权保护例外的规定。
中国参加的国际条约,即使在本国法中没有明确写明也对中国具有约束力。中国参加的《与贸易有关的知识产权协定》(WTO知识产权协定)第9条“与《伯尔尼公约》的关系”约定以下两款:
1. 各成员应遵守《伯尔尼公约》(1971)第1条至第21条及其附录的规定。但是,对于该公约第6条之二授予或派生的权利,各成员在本协定项下不享有权利或义务。
2. 版权的保护仅延伸至表达方式,而不延伸至思想、程序、操作方法或数学概念本身。
此外我国《计算机软件保护条例》第6条也规定条例对软件著作权的保护不延及开发软件所用的思想、处理过程、操作方法或者数学概念等。
依本文观点及第六部分“技术”等论证,至少对生成式人工智能而言,不论其数据训练对象是文字还是图像内容,其生成结果都只是对训练数据中思想、观念、技法、风格(萨格教授习惯称为思想和事实/Facts,或非表达因素)而不是对作品表达的利用。换而言之,生成式人工智能利用的是训练素材中可以生成表达的部分,比如思想和风格。
5.2 其他海外立法
1. 排除对非表达部分的保护
2. 间接允许使用
6 案例分析与借鉴
6.1 中国案例分析
1. 使用他人作品元素的商业实践
2. 使用他人作品元素的侵权案例
3. 经典版权合理使用案——听音识剧
4. 观点总结
6.2 美国案例分析1——安德森等诉Stability AI等
6.3 美国案例分析2——美国Getty Images诉Stability AI
1. 商标权的主张和证据
生成式人工智能不需要依靠复制来学习图形,这已经是公认的技术原理。从实际情况分析,AI训练需要天文数字的训练图片,Stable Diffusion利用过Getty图片并不令人意外。但是AI如果在经过海量图片训练后仍然把Getty的水印误解为通用图形的必要背景,这就不符合常识。
为解决已经训练过的数据不能满足特定需求的问题,例如由于普遍训练使用欧美和韩国女性图片素材,要精准生成藏族女性图像就要增加专门的训练素材,因此需要开放用户在大模型基础上定向训练专门的图像。Stable Diffusion除了提供通用素材训练的技术支持外,也允许用户自行搭建定向训练素材的Lora数据库。
大家应该有印象,一幅出色的Stable Diffusion是什么水平。比如人像光影和毛发,AI可以优秀如斯。
1)诉状技术贴—生成式训练技术
2)诉状技术贴—想象中的节外生枝
7 人工智能数据训练的合理使用是一次全球法律的竞争
7.1 大人,时代变了
7.2 围绕人工智能的法律竞争
注释 请向上滑动阅览
[i]《我国版权立法中文本数据挖掘侵权例外规则的构建——兼论中国知网论文查重争议》,管育鹰,http://www.fxcxw.org.cn/dyna/content.php?id=25175
[ii]<KOREAN COPYRIGHT ACT> Article 35-2 (Temporary Reproduction in Course of Using Works, etc.)Printed articles--Where a person uses works, etc. on a computer, he or she may temporarily reproduce such works, etc. in that computer to the extent deemed necessary for the purpose of smooth and efficient information processing: Provided, that this shall not apply where the use of such works, etc. infringes on copyright
[iii]<Copyright Law of Japan>,https://www.cric.or.jp/english/clj/cl2.html
[iv](1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;(2) the nature of the copyrighted work;(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and(4) the effect of the use upon the potential market for or value of the copyrighted work. The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors."
[v]https://www.judiciary.senate.gov/download/2023-07-12-pm-testimony-sag
[vi]Training generative AI on copyrighted works is usually fair use because it falls into the category of non-expressive.
Courts addressing technologies, such as reverse engineering, search engines, and plagiarism detection software, have held that these “non-expressive uses” are fair use. These cases reflect copyright’s fundamental distinction between protectable original expression, and unprotectable facts, ideas, abstractions, and functional elements.11
Whether training an LLM is a non-expressive use depends on the outputs of the model. If an LLM is trained properly and operated with appropriate safeguards, its outputs will not resemble its inputs in a way that would trigger copyright liability. Training such an LLM on copyrighted works would thus be justified under the fair use doctrine.
[vii]<Parliament's negotiating position on the artificial intelligence act>,https://www.europarl.europa.eu/RegData/etudes/ATAG/2023/747926/EPRS_ATA(2023)747926_EN.pdf
[viii](18) In addition to their significance in the context of scientific research, text and data mining techniques are widely used both by private and public entities to analyse large amounts of data in different areas of life and for various purposes, including for government services, complex business decisions and the development of new applications or technologies. ……In order to provide for more legal certainty in such cases and to encourage innovation also in the private sector, this Directive should provide, under certain conditions, for an exception or limitation for reproductions and extractions of works or other subject matter, for the purposes of text and data mining, and allow the copies made to be retained for as long as is necessary for those text and data mining purposes.
This exception or limitation should only apply where the work or other subject matter is accessed lawfully by the beneficiary, including when it has been made available to the public online, and insofar as the right holders have not reserved in an appropriate manner the rights to make reproductions and extractions for text and data mining. In the case of content that has been made publicly available online, it should only be considered appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service.
[ix]Copyright, Designs and Patents Act 1988, Section 29A.
Copies for text and data analysis for non-commercial research
(1)The making of a copy of a work by a person who has lawful access to the work does not infringe copyright in the work provided that—
(a)the copy is made in order that a person who has lawful access to the work may carry out a computational analysis of anything recorded in the work for the sole purpose of research for a non-commercial purpose, and
(b)the copy is accompanied by a sufficient acknowledgement (unless this would be impossible for reasons of practicality or otherwise).
[x]<Artificial Intelligence and Intellectual Property: copyright and patents: Government response to consultation>,Conclusion
58. The Government has decided to introduce a new copyright and database right exception which allows TDM for any purpose. The Government will identify suitable legislation to make the required changes in due course.
59. Introducing an exception which applies to commercial TDM will bring benefits to a wide range of stakeholders in the UK. These include researchers, AI developers, small businesses, cultural heritage institutions, journalists, and engaged citizens. Targeted products and services will benefit businesses and customers. Research outcomes could also benefit the wider public. This could be, for example, by supporting research and innovation in public health. Some in the creative industries also use TDM and AI to understand their market or create new works – they will also see benefits. The benefits will be reducing the time needed to obtain permission from multiple rights holders and no license fee to pay. This will speed up the TDM process and development of AI.
https://www.gov.uk/government/consultations/artificial-intelligence-and-ip-copyright-and-patents/outcome/artificial-intelligence-and-intellectual-property-copyright-and-patents-government-response-to-consultation
[xi]《日本2018年著作权法修正权利限制规定概要》高嘉鸿108.5 智慧财产权月刊 VOL.245
[xii]《AI训练数据不用担心版权问题?日本政府表态引发热议》,参见https://new.qq.com/rain/a/20230602A09RL000?no-redirect=1
[xiii]<Korean Copyright Act>,https://elaw.klri.re.kr/eng_service/lawView.do?hseq=42726&lang=ENG
[xiv]《以色列司法部对受版权保护的内容用于机器学习的意见》, https://www.gov.il/BlobFolder/legalinfo/machine- learning/he/machine-learning.pdf
[xv]《以色列司法部发布意见书 支持将版权作品用于机器学习》,中国保护知识产权网,http://ipr.mofcom.gov.cn/article/gjxw/gbhj/yzqt/ysl/202302/1976280.html)
[xvi]<Parliament's negotiating position on the artificial intelligence act>,https://www.europarl.europa.eu/RegData/etudes/ATAG/2023/747926/EPRS_ATA(2023)747926_EN.pdf
[xvii]例如《Stable Diffusion原理解读》,https://zhuanlan.zhihu.com/p/583124756
[xviii]<Scraping/Mining Public-Facing Information for Generative AI>,(https://www.dropbox.com/scl/fi/ecvs981dx42caujdgxln5/Matthew-Sag-ABA-Scraping-Webinar-Slides.pptx?rlkey=y9zbunityohvyenlku686h640&dl=0)
[xix]深圳市南山区法院(2019)粤0305民初14010号
[xx]Sarah Andersen等艺术家的起诉状信息量很大:https://stablediffusionlitigation.com/pdf/00201/1-1-stable-diffusion-complaint.pdf
[xxi]< US judge finds flaws in artists' lawsuit against AI companies >
[xxii]Getty诉状值得一读,参见https://stablediffusionlitigation.com/pdf/00201/1-1-stable-diffusion-complaint.pdf
[xxiii]《AIGC商业化,版权保护谁来管?》,https://mp.weixin.qq.com/s/_SAREyljb99vSbbbKO_DnA
[xxiv]2022英国政府咨询回应第34.